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ABSTRACT 

In this paper we present new ideas to accelerate the compu- 
tation of the eigenvector of the transition matrix associated 
to the PageRank algorithm. New ideas are based on the de- 
composition of the matrix-vector product that can be seen 
as a fluid diffusion model, associated to new algebraic equa- 
tions. We show through experimentations on synthetic data 
and on real data-sets how much this approach can improve 
the computation efficiency. 

Categories and Subject Descriptors 

G.2.2 [Discrete Mathematics]: Graph Theory — Graph 
algorithms; F.2.2 [Analysis of algorithms and prob- 
lem complexity]: Nonnumerical Algorithms and Prob- 
lems — Sorting and searching; If. 3. 3 [Information storage 
and retrieval]: Information Search and Retrieval — rele- 
vance feedback, search process 

General Terms 

Algorithms, Experimentation 

Keywords 

Computation, Ranking, Web graph, Markov chain. Eigen- 
vector. 



.^H. 1. INTRODUCTION 



In this paper, we are interested by the computation is- 
sue of the solution of the PageRank equation. PageRank 
is a link analysis algorithm that been initially introduced 
by [24] and used by the Google Internet search engine, that 
assigns a numerical value to each element of a hyper-linked 
set of nodes, such as the World Wide Web. The algorithm 
may be applied to any collection of entities (nodes) that 
are linked through directional relationships. The numerical 
value that it assigns to each node is called the PageRank of 
the node and as we will see below the rank value of the node 
is associated to a eigenvector problem. 

The complexity of the problem for computing the eigen- 
vector of a matrix increases rapidly with the dimension of 
the vector space. Efficient, accurate methods to compute 
eigenvalues and eigenvectors of arbitrary matrices are in 
general a difficult problem (cf. power iteration |26) . QR 
algorithm [TTlIia)- 

In the particular case of PageRank equation, several spe- 
cific solutions were proposed and analyzed [2UI [5] including 
power method [24] with adaptation [15] or extrapolation [121 



1161 [7], iterative aggregation/disaggregation method [211 1131 
I23| . adaptive on-line method [1], etc. 

The approach proposed here is an improvement idea par- 
tially inspired from the algorithm proposed in [1]. We also 
proposed an algebraic proof of Lemma 2.3 in [T]. This ap- 
proach can be also compared to the Gauss-Seidel iteration 
(cf. [25]): the Gauss-Seidel iteration is known to be faster 
than the Jacobi iteration (cf. [2]). However the approach 
can not be distributed because of the constraint in the order 
of the iteration ([TT]). In our approach, the computation of 
each iteration uses the elements of the previous integrating 
the last update per component as with Gauss-Seidel method, 
as opposed to the power iteration (or to Jacobi method) 
where the computation of the whole vector is based on the 
previous vector, without taking into account the update (or 
partial update) at vector entry level. 

As we will show below, our approach has the advantage of 
being iteration order independent and can be very naturally 
deployed using an asynchronous distributed computation. 
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Figure 1: Intuition: collection vs diffusion. 

Finally, if one can associate the Gauss-Seidel method to 
an operation of collection (one entry of vector is updated 
based on the previous vector based on the incoming links), 
our approach consists in an operation of diffusion (the fluid 
diffusion from one entry of the vector consists in updating 
all children nodes following the outgoing links) (cf. Figure 

m- 

The model is introduced in Section [2] Some theoretical 
results are presented in Section [3] Finally we compared 
the computation cost of our approach to existing methods 
through several data-sets in Section [4| 

2. MODEL 
2.1 Notations 



We consider a non-negative matrix P of size N x N such 
that each column sums up to one. In particular, we can 
associate such a matrix to a Markov chain on a state of size 
A'^, where P would be the transition matrix with Pij — p{i, j) 
the transition probability from state j to state i. In the 
following, we will also call i as a node (from web graph and 
PageRank context). 

The fact that each column of P sums up to one means 
that in the Markov chain, each state j has a positive prob- 
ability to jump to at least one state i (if such a condition 
is not true, we complete the matrix P by replacing the zero 
columns with 1/7V (by the personalization vector V in a gen- 
eral case) to get P. This corresponds to the dangling nodes 
completion [20], we will see below that such an adaptation 
is not required in our approach). 

In this paper, we consider the iteration of equation of the 
form: 



A.Xr, 



(1) 



where A is a matrix of size N x N which can be explicitly 
decomposed as: 

A = dP-H(l-d)l/.l* (2) 

where V is a normalized vector of size N and 1* is the column 
vector with all components equal to one. The equation ^ 
is an explicit Doeblin's decomposition [9l[l0]. So we have: 

= dPX„ + {1- d)V. 

We assume that the stationary probability of A is defined 
by the vector X (its existence and uniqueness is straightfor- 
ward e.g. from the contraction property of ^4). We have: 

X = dPX + {1 - d)V. 

In the context of PageRank, the vector V is a personalized 
initial condition cf. [20j . 

2.2 Main equations 

We first define the vector S„ by: 

n 

Sn = (l-d)^d'''P'=V 
fc=0 

So that: 

oo 

Sex, = {l-d)J2d^P''V = {1- d){Id-dP)-W = X 

fc=0 

The equation Sn has been used in [B] to define formulae 
for PageRank derivatives of any order. 

We define a matrix with all entries equal to zero except 
for the k-th diagonal term: {Jk)kk = 1. 

In the following, we assume given a deterministic or ran- 
dom sequence / = {ii, 12, in, ■■■} with i„ £ {1, .., A''}. We 
only require that the number of occurrence of each value 
k £ {1, .., A'^} in / to be infinity. 

Then, we define the fluid vector F associated to / by: 

= {l-d)V (3) 
= dPJ,^F„-i + JkFn-i (4) 

= {Id - + dPJ^JF„-^ (5) 
where Id is the identity matrix. 



Fo 
F„ 



We deflne the history vector H by: 

n 



(6) 



By definition, H„ is an increasing function (all compo- 
nents are positive). 

The above fiuid and history vector is associated to the 
following algorithm (ALGO-REF): 

Initialization: 
H[i] := 0; 
F[i] := (l-d)*v_i; 
R := 1-d; 

k := 1; 

While ( R/(l-d) > Target_Error ) 
Choose i_k; 
sent : = F [i_k] ; 
F[i_k] := 0; 
H[i_k] += sent; 
For all child node j of i_k: 

F[j] += sent*p(j ,i_k)*d; 
R -= sent* (1-d) ; 
k++; 

Remark 1. If d = 1, the above algorithm is equal to the 
one defined in J^. The role of d is here important: in ^Tj, the 
stopping condition is not defined, because in their algorithm, 
what they called cash ( our fluid F) is not decreasing but con- 
stant. Also, because in our algorithm, the total fluid tends 
to zero, we can predict and control precisely the convergence 
to the limit, as we will see below. 

3. THEORETICAL RESULTS 
3.1 Main equations 



Theorem 1. We have the equality: 

H„ + F„ = Fo + dPH„ 



(7) 



Proof. The proof is straightforward by induction: as- 
suming the equation ((Tjl true for n: 

Hn + l -\- F„ + l = Hn + Ji^^iFn+ ^ ] Jk Fn + dP Ji^^^^ Fn 

kl^in + l 

= Hr, + F„+dPJ,^^,F„ 
= Fo + dPH„ + dPJ,^^,F„ 
= Fo + dPH„+i 

Note that the equation ([Tj is true for d = 1 with Fo replaced 
by any initial vector. 

Theorem 2. We have the equality: 

oc 

Soo-H„ = Yd''P''F,, = {I -dPy^Fr, (8) 



and as a direct consequence, we have: 



|F„ 



1-d 



(9) 



where \ ■ \ is the Li norm. 



Proof. 

Soo = dPS^ + Fo 
Using the equation (O, we have: 

S^-Hn = dPSo.+Fo-{Fo + PH„)+Fr, 

= dP{Soo - H„) + F„. 

By iteration, we get ((8)1. The equation ((Ojl is obvious re- 
marking that \P''F„\ = \F„\ 

Now, we assume that the sequence / is chosen such that 
at iteration n: in = arg maxi(_Fn_i)i. Then we have the 
following result: 

Lemma 1. If in = argmaxi(_F„_i)i, we have: 

\Fn\ < \Fn-l\ (^1 - 

Proof. The proof is straightforward, noticing that we 
suppressed Ji„Fn-i and added Ji„Fn-i x d. Then using 

Ji„Fn-l > \Fn-l\/N. 

Thanks to Lemma[TJ we have an exponential convergence 
to zero of Fn- This lemma still holds if for in an entry of 
Fn-i which is larger or equal to |F„_i|/A'' is chosen. 

As a consequence of this lemma, we have: 

-ffoo = Fo+dPHaa- 

Therefore, Hoc = X and Hn is an increasing (component by 
component) function to X. 

Remark 2. From equations ((7} and ([5]), we can obtain 
the following iteration equation on H : 

Hn = {Id - .Kild - dP)) Hn-1 + - d)V. 

This formulation may be directly exploited for an efficient 
distributed computation of Hoc ■ This issue will be addressed 
in a future paper. 

If d = 1, we still have: 

Hn = {Id — Jin{Id — P)) Hn-l + Ji„Xo 
Xn = PXn-1, 

and Hn converges to X^feLo ~ X^feLo ^''Xq after appro- 
priate normalization (case of IT^), at least when all entries 
of P are non-negative ( when the spectral radius of P is one 
and if there are negative entries in P, it may converge or not 
converge): but this convergence is based on a Cesaro aver- 
aging which is known to be very slow ( cf. results in Section 

In all cases, Hn can be simply interpreted as a specific 
way to compute or to estimate the power series X^fe^o P^-^o 
(when there is convergence). 

3.2 Updating equation 

The fact that Hn converges to Soo for any arbitrary choice 
of the sequence I can be also exploited to compute more effi- 
ciently the new eigenvector in case of the graph modification 
(so of P). 

Theorem 3. Assume the initial graph associated to {P, Hoo) 
is modified to the updated graph represented by (P',//^), 
then we have: 

H'o^-Hao = il-dP')-'diP' ^P)Hoo. 



Proof. We have 

{l-dP'){H'^-Hoc) = Fo - {1 - dP')Hoc 

= Fo - (1 - dP)Hoo + d{P' ~ P)H, 

= d{P'-P)Hoo. 

In the above updating equation, (P' — P)Hao may mix 
positive and negative terms. We can apply on them the 
operator Fn separately or jointly. 

Now, assume the equation Hn has been computed up to 
the iteration no and that at that time, we are interested to 
compute the limit associated to P' (for instance, because 
the web graph has been modified/updated). 

Then very naturally, we can apply our diffusion method 
with P', but with modified initial condition Fq = Fng + 
d{P' — P)Hna for which we have: 

K + Pn = Po+dP'K. (10) 

We have the following very intuitive results: 

Theorem 4. Hn^ -\- H'^ (H'^ is the limit of the equation 
(fTOjl ) IS the solution of the equation: 

X = dP'X + Po. 

Proof. The limit of (fTO)l satisfies: 

HL = F^ + dP'H'^ 

= Fn„+d{P' ^ P)Hn„+dP'H'^. 

Combining this with Hn„ -f Fn„ ~ Fo + dPHng, we have 
obviously what we want. 

The above result implies that one can continuously update 
the iterations when P is regularly updated by just injecting 
in the system a fluid quantity equal to d{P' — P)Hn(, and 
then applying the new matrix P'. 

If a distributed computation approach was to be used, we 
just need to synchronize the time from which P' is applied. 

3.3 About dangling nodes 

In the algorithm ALGO-REF, we don't need to complete 
the null columns of the matrix P. We can simply add the 
following condition: 

While ( R/(l-d) > Target_Error ) 
Choose i_k; 
sent : = F [i_k] ; 
H[i_k] += sent; 
F[i_k] := 0; 
If ( i_k has no child ) 

R -= sent ; 
else 

For all child node j of i_k: 

F[j] += sent*p(j ,i_k)*d; 
R -= sent*(l-d) ; 
k++; 

The quantity R/{1 — d) measures exactly (thanks to The- 
orem [2]) the distance to the stationary probability. How- 
ever when P includes dangling nodes, it is easy to see that 
P/(l — d) defines only an upper bound and that Hao need 
to be renormalized (dividing by the norm Li of Hoa) to find 
the probability eigenvector satisfying the PageRank equa- 
tion with the completed matrix P. 



3.4 Asynchronous distributed computation 

The proposed algorithm is very weU suited for the asyn- 
chronous distributed computation (e.g. cf. dH). Indeed, at 
any moment of the iteration, the fluid F„ can be split in any 
number of elements and be distributed per element, the Li 
norm of each element controlling exactly the error that can 
be reduced. The most natural way is to divide per compo- 
nent {Fn)i of the vector, and an obvious particular example 
is to split the initial condition vector Xq in N/m elements 
of size m. 

4. COMPARISON RESULTS 

In the following, we will call: 

• ALGO-MAX: if / is defined by in ~ arg maxi(F„_i)i; 

• ALGO-RAND: if i„ is randomly chosen from {1, .., A''}; 

• ALGO-PER: if in = n mod N (periodic cycle); 

• MAT-ITER: if the matrix product iteration ([T]) is used; 

• OPIC: the one defined in [1] (Random version). 

The stopping condition for ALGO-REF variants are based 
on R/(l -d) > Target_Error. For MAT-ITER, we use the 
well known condition: | xd/(l— d) < Target _Err or 

For OPIC, we measured the convergence by comparing the 
distance to the precomputed limit (from MAT-ITER), since 
it does not define any stopping condition. 

Below we set a simple simulation scenario to get a first 
evaluation of our proposed solution and comparison to ex- 
isting solutions in the context of the original PageRank on 
the web graph. We don't pretend to generate any realistic 
model, for more details on the web graph the readers may 
refer to [H [TH H H [22] . 

4.1 Simulation scenario 

We set A*' the total number of nodes (URLs) to be simu- 
lated. Then we create L random links (directional) to con- 
nect a node i to j as follow: 

• the choice of the source node is done following a power- 
law: l/k°'; 

• the choice of the destination node is done following a 
power-law: 

For simplicity, we assumed no correlation between the 
number of incoming and outgoing links: for that purpose, to 
defined the distribution of the destination nodes by associat- 
ing to node k a probability proportional to 1/k" followed by 
a large number (by default A*") of permutations of randomly 
chosen pair of nodes {i,j). Then, to define the distribution 
of the destination nodes, we did the same operation. In this 
way, the order of the nodes does not introduce any bias for 
ALGO-PER. 

The tables below summarize the characteristics of the 6 
synthetic data we considered varying the number of links 
per node and the parameter a. 

4.2 Simulation results analysis 

We define the elementary step as an operation requiring 
the use of one non-zero entry of A: for instance, if A has 
L entries that are not zero, the product A.X would require 



Scenario 


L (nb links) 


D (Nb dangling nodes) 


SI 


2172 


9552 


S2 


8081 


8646 


S3 


28507 


6252 



Table 1: Parameters: a = 2.0, N = 10000 



Scenario 


L (nb links) 


Nb dangling nodes 


Sib 


12624 


7696 


S2b 


61189 


3197 


S3b 


265245 


33 


Table 2: 


Parameters: 


a = 1.5, N = 10000 



L elementary steps. We already mentioned that ALGO- 
REF variants does not require the matrix completion P. 
For MAT-ITER, we can also avoid such a completion, but 
the cost to pay is the vector renormalization. Therefore, 
we assumed here that the cost of A.X is L where L is the 
number of not zero entries of the initial matrix P. 

For ALGO-RAND and ALGO-PER, when we have no 
fiuid on the node in {Fi^ — 0), we go to the next step 
without incrementing the counter of the elementary steps. 

For ALGO-MAX, there is of course a computation cost to 
find the argmax of Fn- In order to show its full potential, 
its cost is not taken into account in the number of elemen- 
tary steps. In the comparison tests, we found that taking 
the argmax is not necessary and same improvements were 
obtained replacing the argmax by iteration process where in 
one iteration all i such that {Fn)i is above a threshold are 
all chosen, then scaling down the threshold progressively. 

Figure [2l [3] and [4] shows the results for SI, S2 and S3. In 
these scenarios, because of a, there are the proportion of the 
dangling nodes are important. We can observe in all cases 
a substantial gain with our approach. 
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Figure 2: Scenario SI. L/N = 0.22, D/N = 0.96. 

Figure O [6] and [7] shows the results for Sib, S2b and S3b. 
In these scenarios, the proportion of the dangling nodes are 
less important. We can still observe in all cases a substantial 
gain with our approach. 

4.3 Comparison on web graph 
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Figure 3: Scenario S2. L/N = 0.81, D/N = 0.86. 
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Figure 4: Scenario S3. L/N = 2.9, D/N = 0.62. 
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Figure 6: Scenario S2b. L/N = 6.1, D/N = 0.32. 
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Figure 7: Scenario S3b. L/N = 26.5, D/N = 0.003. 
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Figure 5: Scenario Sib. L/N = 1.3, D/N = 0.77. 



In this section, we realized the same comparison tests than 
above on a web graph imported from the dataset grO . California 
(available on http : //www. cs . Cornell . edu/Courses/ cs685/ 
2002f a/) and from the dataset uk-2007-05(§1000000 (avail- 
able on http://law.dsi.unimi.it/datasets.php). 

The results for the dataset grO . Calif ornia is shown in 
Figure [H] In this figure, we added ALGO-OP a variant of 
ALGO-MAX where the argmax is replaced by 

in = argmax((F„_i)i/((#mi + 1) x {#outi + 1))) 

i 

with #mi (resp. ^outi) equal to the number of incoming 
(resp. outgoing) links to (resp. from) the node i. The 
intuition of this renormalization is: 

• #out: the cost of the diffusion of the fluid F is pro- 
portional to #out; 

• #in: when there are many incoming links, it is worth 
to wait and aggregate the fluid before the fluid diffu- 
sion. 

We see that we still have a signiflcant gain and that ALGO- 
OP outperforms ALGO-MAX. The main reason for which 
the proposed approach outperforms greatly the original OPIG 
algorithm is the fact that we don't have the fluctuation due 
to the Cesaro averaging as in OPIG (which converges as 
l/sqrt{n)). 
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Figure 8: Dataset grO. California: 
9664 nodes (4637 dangling nodes). 
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Figure[9]shows the results for the dataset uk-2007-05@1000000. 
Here ALGO-MAX outperforms ALGO-OP. It shows that 
the optimization of the sequence choice I is still an open 
problem. We added here ALGO-OP2 based on: 

i„ = argmax ((_F„_i)i/(#oufi + 1)) . 



5. CONCLUSION 

In this paper, we proposed a new algorithm to optimize 
the computation cost of the eigenvector of PageRank equa- 
tion and showed the theoretical potential gain. 

We believe that we have here a promising new approach 
and we are still investigating on a practical optimal ALGO- 
OP variants solution. 

In a future work, we wish to validate the performance 
of our approach in terms of the effective run time cost for 
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Figure 9: Dataset uk-2007-05@1000000: 41247159 
links on 1000000 nodes (45766 dangling nodes). 



large data, in particular, using the asynchronous distributed 
computation approach. 
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